A proposal to map the principles of [[Site Reliability Engineering]] (SRE) to the design and maintenance of resilient human communities and social systems.
If we view a [[community]] as a distributed system, we can apply the rigorous engineering practices used to keep high-availability systems (like [[Google]]) running to keep our social groups healthy. The goal is not to treat people like machines, but to build systems that are resilient to human error and conflict.
In SRE, a Service Level Objective (SLO) defines the acceptable level of reliability (e.g., "99.9% of requests will succeed"). In a community, this maps to a [[Social Contract]].
In SRE, an Error Budget is the allowed amount of downtime. If you have budget left, you can take risks and push code. If you burn it all, you must freeze changes. In a community, this maps to a [[Forgiveness Budget]].
In SRE, when a system breaks, we declare an Incident. We assign an Incident Commander (IC). We follow a Runbook. In a community, this maps to [[Conflict Resolution]] protocols.
In SRE, after an incident, we hold a Blameless Post-Mortem. The goal is not to fire the engineer who pushed the bug, but to understand why the system allowed the bug to be pushed. In a community, this maps to [[Restorative Justice Circles]].